Skip to content

Conversation

@jeynmann
Copy link
Contributor

@jeynmann jeynmann commented Nov 6, 2025

What?

Support channel id in ucx perftest.

Why?

Improve performance by distribute request on qps.

How?

Select channel by thread id modulo numbers of channel.

Summary by CodeRabbit

  • New Features

    • Multi-channel device endpoint support for CUDA performance tests, allowing workloads to be distributed across configurable device channels for more accurate measurements.
  • Chores

    • New channel count parameter added with a default of 1 and initialization added so tests behave consistently when channels are unspecified.

@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

Added a device endpoint channel count parameter and propagated it through perf test initialization into CUDA context and kernel code. CUDA runtime now computes a per-thread channel_id (threadIdx.x % num_channels) and uses it in device_put_* calls; device-side arrays are copied via a new device_vector helper.

Changes

Cohort / File(s) Summary
Public config
src/tools/perf/api/libperf.h
Added unsigned device_ep_channel_count to ucx_perf_params_t.
Test initialization
src/tools/perf/perftest.c, src/tools/perf/perftest_params.c
Initialize device_ep_channel_count = 1 in perftest params and set it in adjust_test_params when device send is used.
CUDA context
src/tools/perf/cuda/cuda_kernel.cuh
Added unsigned num_channels to ucx_perf_cuda_context and populate it from perf.params.device_ep_channel_count in the constructor.
CUDA kernel & runtime
src/tools/perf/cuda/ucp_cuda_kernel.cu
Added num_channels to ucp_perf_cuda_params; initialize from perf params; compute channel_id = threadIdx.x % num_channels and pass it to device_put_single, device_put_multi, and device_put_multi_partial; added device_vector<T> helper and use it to allocate/copy indices, offsets, lengths to device.

Sequence Diagram

sequenceDiagram
    participant Config as Config (libperf)
    participant Test as Test Init
    participant Host as Host runtime
    participant GPU as CUDA Kernel
    participant NIC as Device ops

    Config->>Test: device_ep_channel_count
    Test->>Host: populate params (num_channels)
    Host->>Host: device_vector(copy indices/offsets/lengths)
    Host->>GPU: launch kernel (params, device arrays)
    GPU->>GPU: channel_id = threadIdx.x % num_channels
    GPU->>NIC: device_put_* (with channel_id)
    NIC-->>GPU: completion
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review areas needing attention:
    • Correctness of channel_id computation and its usage across all device_put_* call sites.
    • Memory lifetime and error handling in device_vector<T> (allocation/copy/free).
    • Consistency of num_channels propagation between host structs and device-visible params.

Suggested reviewers

  • brminich
  • iyastreb
  • ofirfarjun7

Poem

🐰 New channels hum where threads once played,
I hop through loops where indices are laid.
From config seed to kernel's bright art,
Each thread finds a lane — a channeled heart. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'TOOLS/DEVICE: support channel id in perftest' directly and accurately summarizes the main change: adding channel ID support to the perftest tool.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84da1e1 and d4803a0.

📒 Files selected for processing (1)
  • src/tools/perf/cuda/ucp_cuda_kernel.cu (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T09:04:19.215Z
Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.215Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.

Applied to files:

  • src/tools/perf/cuda/ucp_cuda_kernel.cu
🧬 Code graph analysis (1)
src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
src/ucp/api/device/ucp_device_impl.h (2)
  • ucp_device_put_single (143-166)
  • ucp_device_put_multi (264-290)
🔇 Additional comments (3)
src/tools/perf/cuda/ucp_cuda_kernel.cu (3)

113-113: LGTM!

The addition of the num_channels field to the parameter structure is appropriate for supporting channel-based routing.


214-217: LGTM!

The device_vector helper correctly allocates device memory and copies host data, with proper cleanup in the destructor. The usage for transferring indices and offsets to the device is appropriate.

Also applies to: 229-237


126-126: No issues found—device_ep_channel_count is always initialized to 1.

Verification confirms all code paths set device_ep_channel_count = 1 before use (perftest.c:224 and perftest_params.c:760). The modulo operation at line 249 cannot cause division by zero with current code. Defensive validation is unnecessary unless the TODO in perftest_params.c (reading channel count from ucp config) is implemented in the future.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e88f7 and 7355ab4.

📒 Files selected for processing (9)
  • contrib/test_jenkins.sh (1 hunks)
  • contrib/ucx_perftest_config/test_types_ucp_device_cuda (2 hunks)
  • src/tools/perf/api/libperf.h (3 hunks)
  • src/tools/perf/cuda/cuda_kernel.cuh (4 hunks)
  • src/tools/perf/cuda/ucp_cuda_kernel.cu (7 hunks)
  • src/tools/perf/perftest.c (2 hunks)
  • src/tools/perf/perftest.h (1 hunks)
  • src/tools/perf/perftest_params.c (3 hunks)
  • test/gtest/common/test_perf.cc (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/tools/perf/perftest.c (2)
src/tools/perf/lib/ucp_tests.cc (4)
  • params (197-219)
  • params (197-199)
  • params (221-231)
  • params (221-221)
src/ucs/sys/sys.c (1)
  • ucs_get_page_size (530-543)
src/tools/perf/perftest_params.c (1)
src/tools/perf/lib/uct_tests.cc (2)
  • params (118-130)
  • params (118-118)
src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
src/ucp/api/device/ucp_device_impl.h (5)
  • ucs_status_t (83-104)
  • ucp_device_progress_req (441-451)
  • ucp_device_put_single (143-166)
  • ucp_device_put_multi (264-290)
  • ucp_device_put_multi_partial (347-377)

Comment on lines 52 to 66
if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /
(completed - m_ctx.completed_iters);
m_last_report_time = cur_time;
m_ctx.completed_iters = completed;
__threadfence_system();

m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),
m_max_iters);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard against zero deltas before dividing.

Here we divide by both (completed - m_ctx.completed_iters) and iter_time. If the progress callback fires twice with the same completed (e.g. because m_report_interval_ns / iter_time rounded to zero) or if the GPU clock hasn’t advanced yet, both denominators become zero and the SM traps with a divide-by-zero. Please bail out when delta == 0 and clamp iter_time to at least 1 before the final division.

@@
-        if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
-            ucx_perf_cuda_time_t cur_time  = ucx_perf_cuda_get_time_ns();
-            ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /
-                                             (completed - m_ctx.completed_iters);
+        if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
+            ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters;
+            if (delta_iters == 0) {
+                return;
+            }
+
+            ucx_perf_cuda_time_t cur_time  = ucx_perf_cuda_get_time_ns();
+            ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time;
+            if (delta_time == 0) {
+                delta_time = 1;
+            }
+
+            ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1);
             m_last_report_time             = cur_time;
             m_ctx.completed_iters          = completed;
             __threadfence_system();
 
-            m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),
-                                         m_max_iters);
+            ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time;
+            if (next_interval == 0) {
+                next_interval = 1;
+            }
+            m_next_report_iter = ucs_min(completed + next_interval, m_max_iters);
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /
(completed - m_ctx.completed_iters);
m_last_report_time = cur_time;
m_ctx.completed_iters = completed;
__threadfence_system();
m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),
m_max_iters);
if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters;
if (delta_iters == 0) {
return;
}
ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time;
if (delta_time == 0) {
delta_time = 1;
}
ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1);
m_last_report_time = cur_time;
m_ctx.completed_iters = completed;
__threadfence_system();
ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time;
if (next_interval == 0) {
next_interval = 1;
}
m_next_report_iter = ucs_min(completed + next_interval, m_max_iters);
}
🤖 Prompt for AI Agents
In src/tools/perf/cuda/cuda_kernel.cuh around lines 52 to 61, the code can
divide by zero when computing per-iteration times because both delta =
(completed - m_ctx.completed_iters) and iter_time can be zero; compute and store
delta first, if delta == 0 bail out/skip the reporting block to avoid dividing
by zero, then compute iter_time and clamp it to at least 1 (e.g., use a max with
1) before using it in any division or in calculating m_next_report_iter so no
division by zero occurs.

Comment on lines 83 to 89
if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
req = nullptr;
flags = static_cast<ucp_device_flags_t>(0);
} else {
req = &m_requests[index];
m_pending_map |= UCS_BIT(index);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not hand out a null request under flow control

When get_request() returns req == nullptr, the very next call to ucp_perf_cuda_send_async() passes that pointer straight into ucp_device_put_single/multi/multi_partial, all of which dereference req unconditionally. With device_fc_window > 1, we therefore crash every time this branch fires. Please always return a valid ucp_device_request_t* here and keep the existing bitmap semantics for the FC “prefetch” case.

-        if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
-            req   = nullptr;
-            flags = static_cast<ucp_device_flags_t>(0);
-        } else {
-            req            = &m_requests[index];
-            m_pending_map |= UCS_BIT(index);
-        }
+        req = &m_requests[index];
+        if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+            flags = static_cast<ucp_device_flags_t>(0);
+        } else {
+            m_pending_map |= UCS_BIT(index);
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
req = nullptr;
flags = static_cast<ucp_device_flags_t>(0);
} else {
req = &m_requests[index];
m_pending_map |= UCS_BIT(index);
}
req = &m_requests[index];
if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
flags = static_cast<ucp_device_flags_t>(0);
} else {
m_pending_map |= UCS_BIT(index);
}

Comment on lines 728 to 731
case 'F':
return parse_int(opt_arg, &params->super.device_fc_window,
"device flow control window size", 1, INT_MAX);
case 'y':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the -F parser’s target type.

parse_int() expects an int*, but we are handing it an unsigned*. With -Werror (the default in CI) this is an incompatible pointer type error, so the whole build breaks. Please parse into a local int and assign it back once validation succeeds.

@@
-    case 'F':
-        return parse_int(opt_arg, &params->super.device_fc_window,
-                         "device flow control window size", 1, INT_MAX);
+    case 'F': {
+        int fc_window;
+
+        status = parse_int(opt_arg, &fc_window,
+                           "device flow control window size", 1, INT_MAX);
+        if (status == UCS_OK) {
+            params->super.device_fc_window = fc_window;
+        }
+        return status;
+    }
🤖 Prompt for AI Agents
In src/tools/perf/perftest_params.c around lines 728 to 731, the -F option
passes an unsigned int pointer to parse_int which expects an int*, causing an
incompatible pointer type error under -Werror; fix it by parsing into a local
int variable (call parse_int with &local_int and the same bounds), check
parse_int succeeded, then assign local_int to params->super.device_fc_window
(cast/convert to the unsigned field) after validation; ensure you preserve the
same bounds and error handling behavior.


void init_params(const ucx_perf_context_t &perf)
{
m_params.num_channels = perf.params.device_ep_channel_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just set it in ucp_perf_cuda_params_handler and remove this func.

@jeynmann jeynmann changed the title [DNM][WIP] TOOLS/DEVICE: support channel id in perftest TOOLS/DEVICE: support channel id in perftest Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants